Exploratory Analysis of the variants

Removing outliers

Variants distribution

Kmeans

Correlation matrices

Correlation matrix for all individuals

P-value matrix for all individuals

Correlation matrix for target individuals

Correlation matrix for control individuals

Distribution for each variant between affected and control

Distribution for each variant for each kmeans cluster

Comparison analysis

Loss-of-function grouping

Number of LoF Affected Control Fisher
Grouping Method: Frequency
Rare variants
[25,48) 20.000 66.000 0.125
[48,92] 28.000 53.000
[25,48) 0.417 0.555
[48,92] 0.583 0.445
Singletons
[0, 8) 24.000 67.000 0.495
[8,17] 24.000 52.000
[0, 8) 0.500 0.563
[8,17] 0.500 0.437
Grouping Method: Presence versus Absence
Rare variants
0 0.000 0.000 1
[ 1,92] 48.000 119.000
0 0.000 0.000
[ 1,92] 1.000 1.000
Singletons
0 0.000 1.000
[ 1,92] 48.000 118.000
0 0.000 0.008
[ 1,92] 1.000 0.992
Grouping Method: Kmeans
Rare variants
[25.0,52.6) 29.000 87.000 0.137
[52.6,92.0] 19.000 32.000
[25.0,52.6) 0.604 0.731
[52.6,92.0] 0.396 0.269
Singletons
[ 0.00, 8.47) 29.000 87.000
[ 8.47,17.00] 19.000 32.000
[ 0.00, 8.47) 0.604 0.731
[ 8.47,17.00] 0.396 0.269

Loss-of-function ROC curve

Logistic Regression

term estimate p.value
(Intercept) -1386.014 0.996
fra_fr1 -39.094 0.994
spl_fr4 110.454 0.995
sto_fr2 -43.300 0.995
sto_fr3 -25.312 0.995
nfr_fr1 17.541 0.994
nfr_fr2 -43.654 0.995
nfr_fr3 -38.630 0.994
nfr_fr4 169.998 0.995
nsy_fr1 1.794 0.995
syn_fr1 -1.408 0.996
syn_fr3 4.300 0.994
syn_fr4 -9.714 0.996
utr5_fr3 10.648 0.996
utr3_fr1 0.064 0.995
utr3_fr3 -3.969 0.994
ncRNA_fr4 68.261 0.994
miRNA_fr1 37.627 0.996
miRNA_fr2 78.987 0.996
miRNA_fr3 190.749 0.994
miRNA_fr4 -420.030 0.998
bnd_fr2 -1.306 0.996
reg_fr2 19.270 0.996
reg_fr3 33.597 0.994
H AUC KS TP FP TN FN
scores 0.014 0.548 0.106 9 20 42 12

Decicion Tree

## 
## Classification tree:
## rpart(formula = fml, data = lrn)
## 
## Variables actually used in tree construction:
## [1] bnd_fr1   ncRNA_fr3 utr3_fr3 
## 
## Root node error: 27/84 = 0.32143
## 
## n= 84 
## 
##        CP nsplit rel error xerror    xstd
## 1 0.22222      0   1.00000 1.0000 0.15853
## 2 0.11111      2   0.55556 1.0741 0.16139
## 3 0.01000      4   0.33333 1.1481 0.16380
H AUC KS TP FP TN FN
scores 0.299 0.764 0.412 10 4 58 11

Neural Network

H AUC KS TP FP TN FN
V1 0 0.5 0 0 0 62 21

Support Vector Machine

H AUC KS TP FP TN FN
scores 0.224 0.71 0.343 2 0 62 19

PCA with counting variants

PCA with raw vcf

## VCF Format ==> SNP GDS Format
## Method: exacting biallelic SNPs
## Number of samples: 179
## Parsing "ind179.vcf" ...
##  import 1213997 variants.
## + genotype   { Bit2 179x1213997, 51.8M } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'final.gds' (60.9M)
##     # of fragments: 220
##     save to 'final.gds.tmp'
##     rename 'final.gds.tmp' (60.9M, reduced: 2.3K)
##     # of fragments: 20
## Principal Component Analysis (PCA) on genotypes:
## Excluding 26,624 SNPs on non-autosomes
## Excluding 3,634 SNPs (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 179 samples, 1,183,739 SNPs
##     using 1 (CPU) core
## PCA: the sum of all selected genotypes (0, 1 and 2) = 344168751
## Wed Jan 31 21:04:55 2018    (internal increment: 4184)
## 
[..................................................]  0%, ETC: ---    
[======================>...........................] 43%, ETC: 6s  
[==============================================>...] 91%, ETC: 1s  
[==================================================] 100%, completed      
## Wed Jan 31 21:05:05 2018    Begin (eigenvalues and eigenvectors)
## Wed Jan 31 21:05:05 2018    Done.

Cluster 1

LoF

Loss-of-function grouping

Number of LoF Affected Control Fisher
Grouping Method: Frequency
Rare variants
[25,46) 19.000 50.000 1
[46,92] 19.000 47.000
[25,46) 0.500 0.515
[46,92] 0.500 0.485
Singletons
[0, 8) 20.000 52.000
[8,17] 18.000 45.000
[0, 8) 0.526 0.536
[8,17] 0.474 0.464
Grouping Method: Presence versus Absence
Rare variants
0 0.000 0.000
[ 1,92] 38.000 97.000
0 0.000 0.000
[ 1,92] 1.000 1.000
Singletons
0 0.000 1.000
[ 1,92] 38.000 96.000
0 0.000 0.010
[ 1,92] 1.000 0.990
Grouping Method: Kmeans
Rare variants
[25.0,46.8) 20.000 56.000 0.7
[46.8,92.0] 18.000 41.000
[25.0,46.8) 0.526 0.577
[46.8,92.0] 0.474 0.423
Singletons
[ 0.00, 7.64) 20.000 52.000 1
[ 7.64,17.00] 18.000 45.000
[ 0.00, 7.64) 0.526 0.536
[ 7.64,17.00] 0.474 0.464

Loss-of-function ROC curve

Logistic Regression

term estimate p.value
(Intercept) -1143.488 0.997
spl_fr1 -9.677 0.996
syn_fr1 -0.296 0.996
syn_fr2 -1.978 0.996
utr5_fr2 13.069 0.996
ncRNA_fr1 0.865 0.996
ncRNA_fr3 9.111 0.996
bnd_fr3 -2.706 0.996
reg_fr1 3.444 0.996
reg_fr2 -20.612 0.997
H AUC KS TP FP TN FN
scores 0.2 0.68 0.365 10 5 39 13

Decicion Tree

## 
## Classification tree:
## rpart(formula = fml, data = lrn)
## 
## Variables actually used in tree construction:
## [1] ncRNA_fr3 reg_fr4  
## 
## Root node error: 15/68 = 0.22058824
## 
## n= 68 
## 
##     CP nsplit rel error    xerror       xstd
## 1 0.20      0       1.0 1.0000000 0.22794908
## 2 0.01      2       0.6 1.3333333 0.25048972
H AUC KS TP FP TN FN
scores 0.099 0.619 0.234 20 36 8 3

Neural Network

H AUC KS TP FP TN FN
V1 0 0.5 0 0 0 44 23

Support Vector Machine

H AUC KS TP FP TN FN
scores 0.475 0.801 0.605 0 0 44 23

Principal Components Analysis

Cluster 2

LoF

Loss-of-function grouping

Number of LoF Affected Control Fisher
Grouping Method: Frequency
Rare variants
[35,61) 3.0 13.000 0.252
[61,86] 7.0 9.000
[35,61) 0.3 0.591
[61,86] 0.7 0.409
Singletons
[2, 9) 4.0 15.000 0.244
[9,13] 6.0 7.000
[2, 9) 0.4 0.682
[9,13] 0.6 0.318
Grouping Method: Presence versus Absence
Rare variants
0 0.0 0.000 1
[ 1,86] 10.0 22.000
0 0.0 0.000
[ 1,86] 1.0 1.000
Singletons
0 0.0 0.000
[ 1,86] 10.0 22.000
0 0.0 0.000
[ 1,86] 1.0 1.000
Grouping Method: Kmeans
Rare variants
[35.0,61.8) 3.0 15.000 0.062
[61.8,86.0] 7.0 7.000
[35.0,61.8) 0.3 0.682
[61.8,86.0] 0.7 0.318
Singletons
[ 2.00, 7.85) 4.0 15.000 0.244
[ 7.85,13.00] 6.0 7.000
[ 2.00, 7.85) 0.4 0.682
[ 7.85,13.00] 0.6 0.318

Loss-of-function ROC curve

Logistic Regression

term estimate p.value
(Intercept) 225.479 0.999
fra_fr1 -9.383 0.999
spl_fr1 17.046 0.999
spl_fr2 -31.320 0.999
sto_fr2 -29.732 0.999
nfr_fr3 -13.087 0.999
H AUC KS TP FP TN FN
scores 0.025 0.518 0.145 3 6 5 2

Decicion Tree

## 
## Classification tree:
## rpart(formula = fml, data = lrn)
## 
## Variables actually used in tree construction:
## character(0)
## 
## Root node error: 5/16 = 0.3125
## 
## n= 16 
## 
##     CP nsplit rel error xerror xstd
## 1 0.01      0         1      0    0
H AUC KS TP FP TN FN
scores 0 0.5 0 5 11 0 0

Neural Network

H AUC KS TP FP TN FN
V1 0 0.5 0 0 0 11 5

Support Vector Machine

H AUC KS TP FP TN FN
scores 0.276 0.673 0.364 0 0 11 5

Principal Component Analysis

Only Exonic Regions

LoF

Loss-of-function grouping

Number of LoF Affected Control Fisher
Grouping Method: Frequency
Rare variants
[25,48) 20.000 66.000 0.125
[48,92] 28.000 53.000
[25,48) 0.417 0.555
[48,92] 0.583 0.445
Singletons
[0, 8) 24.000 67.000 0.495
[8,17] 24.000 52.000
[0, 8) 0.500 0.563
[8,17] 0.500 0.437
Grouping Method: Presence versus Absence
Rare variants
0 0.000 0.000 1
[ 1,92] 48.000 119.000
0 0.000 0.000
[ 1,92] 1.000 1.000
Singletons
0 0.000 1.000
[ 1,92] 48.000 118.000
0 0.000 0.008
[ 1,92] 1.000 0.992
Grouping Method: Kmeans
Rare variants
[25.0,52.6) 29.000 87.000 0.137
[52.6,92.0] 19.000 32.000
[25.0,52.6) 0.604 0.731
[52.6,92.0] 0.396 0.269
Singletons
[ 0.00, 8.47) 29.000 87.000
[ 8.47,17.00] 19.000 32.000
[ 0.00, 8.47) 0.604 0.731
[ 8.47,17.00] 0.396 0.269

Loss-of-function ROC curve

Logistic Regression

term estimate p.value
(Intercept) 3.151 0.303
fra_fr3 0.078 0.088
spl_fr1 0.050 0.063
spl_fr4 0.291 0.158
sto_fr3 -0.159 0.128
sto_fr4 0.597 0.019
nfr_fr1 0.041 0.04
nfr_fr4 0.324 0.114
syn_fr1 -0.002 0.005
syn_fr2 -0.011 0.017
syn_fr3 0.017 0.048
H AUC KS TP FP TN FN
scores 0.223 0.716 0.408 10 14 48 11

Decicion Tree

## 
## Classification tree:
## rpart(formula = fml, data = lrn)
## 
## Variables actually used in tree construction:
## [1] nsy_fr2 spl_fr3 sto_fr1 syn_fr3 syn_fr4
## 
## Root node error: 27/84 = 0.32142857
## 
## n= 84 
## 
##            CP nsplit  rel error    xerror       xstd
## 1 0.185185185      0 1.00000000 1.0000000 0.15853162
## 2 0.074074074      1 0.81481481 1.2222222 0.16578249
## 3 0.037037037      2 0.74074074 1.2592593 0.16661767
## 4 0.010000000      6 0.59259259 1.2962963 0.16735113
H AUC KS TP FP TN FN
scores 0.03 0.52 0.114 20 52 10 1

Neural Network

H AUC KS TP FP TN FN
V1 0 0.5 0 0 0 62 21

Support Vector Machine

H AUC KS TP FP TN FN
scores 0.231 0.706 0.33 1 0 62 20

Principal Components Analysis

Only Regulatory Regions

All regulatory Regions

All regulatory grouping

Number of LoF Affected Control Fisher
Grouping Method: Frequency
Rare variants
[ 650,1494) 28.000 56.000 0.232
[1494,2728] 20.000 63.000
[ 650,1494) 0.583 0.471
[1494,2728] 0.417 0.529
Singletons
[ 62,175) 25.000 60.000 0.866
[175,258] 23.000 59.000
[ 62,175) 0.521 0.504
[175,258] 0.479 0.496
Grouping Method: Presence versus Absence
Rare variants
0 0.000 0.000 1
[ 1,2728] 48.000 119.000
0 0.000 0.000
[ 1,2728] 1.000 1.000
Singletons
0 0.000 0.000
[ 1,2728] 48.000 119.000
0 0.000 0.000
[ 1,2728] 1.000 1.000
Grouping Method: Kmeans
Rare variants
[ 650,1481) 28.000 54.000 0.171
[1481,2728] 20.000 65.000
[ 650,1481) 0.583 0.454
[1481,2728] 0.417 0.546
Singletons
[ 62,146) 9.000 33.000 0.245
[146,258] 39.000 86.000
[ 62,146) 0.188 0.277
[146,258] 0.812 0.723

Regulatory ROC curve

Logistic Regression

term estimate p.value
(Intercept) -7.029 0.005
utr5_fr1 -0.004 0.032
utr5_fr2 0.058 0.034
utr3_fr3 -0.013 0.007
ncRNA_fr1 0.003 0.145
ncRNA_fr3 0.028 0.002
ncRNA_fr4 0.152 0.064
miRNA_fr3 0.972 0.02
miRNA_fr4 -3.421 0.016
bnd_fr1 0.001 0.053
bnd_fr2 -0.009 0.031
reg_fr1 -0.027 0.126
reg_fr3 0.132 0.076
H AUC KS TP FP TN FN
scores 0.108 0.627 0.234 7 12 50 14

Decicion Tree

## 
## Classification tree:
## rpart(formula = fml, data = lrn)
## 
## Variables actually used in tree construction:
## [1] bnd_fr1   ncRNA_fr3 utr3_fr3 
## 
## Root node error: 27/84 = 0.32142857
## 
## n= 84 
## 
##           CP nsplit  rel error     xerror       xstd
## 1 0.22222222      0 1.00000000 1.00000000 0.15853162
## 2 0.11111111      2 0.55555556 0.70370370 0.14201364
## 3 0.01000000      4 0.33333333 0.74074074 0.14457779
H AUC KS TP FP TN FN
scores 0.299 0.764 0.412 10 4 58 11

Neural Network

H AUC KS TP FP TN FN
V1 0 0.5 0 0 0 62 21

Support Vector Machine

H AUC KS TP FP TN FN
scores 0.285 0.752 0.441 2 2 60 19

Principal Components Analysis